Chinese CCGbank: extracting CCG derivations from the Penn Chinese Treebank
نویسندگان
چکیده
Automated conversion has allowed the development of wide-coverage corpora for a variety of grammar formalisms without the expense of manual annotation. Analysing new languages also tests formalisms, exposing their strengths and weaknesses. We present Chinese CCGbank, a 760,000 word corpus annotated with Combinatory Categorial Grammar (CCG) derivations, induced automatically from the Penn Chinese Treebank (PCTB). We design parsimonious CCG analyses for a range of Chinese syntactic constructions, and transform the PCTB trees to produce them. Our process yields a corpus of 27,759 derivations, covering 98.1% of the PCTB.
منابع مشابه
CCGbank: A Corpus of CCG Derivations and Dependency Structures Extracted from the Penn Treebank
This article presents an algorithm for translating the Penn Treebank into a corpus of Combinatory Categorial Grammar (CCG) derivations augmented with local and long-range word–word dependencies. The resulting corpus,CCGbank,includes 99.4% of the sentences in the Penn Treebank. It is available from the Linguistic Data Consortium,and has been used to train widecoverage statistical parsers that ob...
متن کاملA Machine Learning Approach to Convert CCGbank to Penn Treebank
Conversion between different grammar frameworks is of great importance to comparative performance analysis of the parsers developed based on them and to discover the essential nature of languages. This paper presents an approach that converts Combinatory Categorial Grammar (CCG) derivations to Penn Treebank (PTB) trees using a maximum entropy model. Compared with previous work, the presented te...
متن کاملExtending CCGbank with Quotes and Multi-modal CCG
CCGbank is an automatic conversion of the Penn Treebank to Combinatory Categorial Grammar (CCG). We present two extensions to CCGbank which involve manipulating its derivation and category structure. We discuss approaches for the automatic re-insertion of removed quote symbols and evaluate their impact on the performance of the C&C CCG parser. We also analyse CCGbank to extract a multi-modal CC...
متن کاملParsing Noun Phrase Structure with CCG
Statistical parsing of noun phrase (NP) structure has been hampered by a lack of goldstandard data. This is a significant problem for CCGbank, where binary branching NP derivations are often incorrect, a result of the automatic conversion from the Penn Treebank. We correct these errors in CCGbank using a gold-standard corpus of NP structure, resulting in a much more accurate corpus. We also imp...
متن کاملHindi CCGbank: CCG Treebank from the Hindi Dependency Treebank
In this paper, we present an approach for automatically creating a Combinatory Categorial Grammar (CCG) treebank from a dependency treebank for the Subject-Object-Verb language Hindi. Rather than a direct conversion from dependency trees to CCG trees, we propose a two stage approach: a language independent generic algorithm first extracts a CCG lexicon from the dependency treebank. A determinis...
متن کامل